Goto

Collaborating Authors

 Quảng Ninh Province


PromptTailor: Multi-turn Intent-Aligned Prompt Synthesis for Lightweight LLMs

Xu, Yizhou, Davis, Janet

arXiv.org Artificial Intelligence

Lightweight language models remain attractive for on-device and privacy-sensitive applications, but their responses are highly sensitive to prompt quality. For open-ended generation, non-expert users often lack the knowledge or time to consistently craft high-quality prompts, leading them to rely on prompt optimization tools. However, a key challenge is ensuring the optimized prompts genuinely align with users' original intents and preferences. We introduce PromptTailor, a system for controllable prompt generation for open-ended text that improves model output quality by intent-aligned prompt synthesis. PromptTailor expands minimal user instructions into rich, domain-aware prompts while preserving the user's stated preferences. The system is a quantized Llama3-8B model fine-tuned with a lightweight LoRA adapter on 12,300 prompt-refinement dialogues spanning 41 everyday domains, distilled from three stronger LLMs. The adapter attaches to any Llama3-8B base, enabling edge deployment. In human and LLM-judge evaluations across multiple target models and optimization baselines, PromptTailor yields higher preference rates than chain-of-thought prompting and matches or surpasses state-of-the-art prompt optimization methods while requiring fewer model calls (e.g., 3 vs. 9). These results show that a compact student, guided by powerful teachers, can learn effective prompt-generation strategies that enhance response quality while maintaining alignment with user intent.


Systemic approach for modeling a generic smart grid

Amor, Sofiane Ben, Guerard, Guillaume, Levy, Loup-Noé

arXiv.org Artificial Intelligence

Smart grid technological advances present a recent class of complex interdisciplinary modeling and increasingly difficult simulation problems to solve using traditional computational methods. To simulate a smart grid requires a systemic approach to integrated modeling of power systems, energy markets, demand-side management, and much other resources and assets that are becoming part of the current paradigm of the power grid. This paper presents a backbone model of a smart grid to test alternative scenarios for the grid. This tool simulates disparate systems to validate assumptions before the human scale model. Thanks to a distributed optimization of subsystems, the production and consumption scheduling is achieved while maintaining flexibility and scalability.


A short methodological review on social robot navigation benchmarking

Chhetri, Pranup, Torrejon, Alejandro, Eslava, Sergio, Manso, Luis J.

arXiv.org Artificial Intelligence

Social Robot Navigation is the skill that allows robots to move efficiently in human-populated environments while ensuring safety, comfort, and trust. Unlike other areas of research, the scientific community has not yet achieved an agreement on how Social Robot Navigation should be benchmarked. This is notably important, as the lack of a de facto standard to benchmark Social Robot Navigation can hinder the progress of the field and may lead to contradicting conclusions. Motivated by this gap, we contribute with a short review focused exclusively on benchmarking trends in the period from January 2020 to July 2025. Of the 130 papers identified by our search using IEEE Xplore, we analysed the 85 papers that met the criteria of the review. This review addresses the metrics used in the literature for benchmarking purposes, the algorithms employed in such benchmarks, the use of human surveys for benchmarking, and how conclusions are drawn from the benchmarking results, when applicable.


Does quantization affect models' performance on long-context tasks?

Mekala, Anmol, Atmakuru, Anirudh, Song, Yixiao, Karpinska, Marzena, Iyyer, Mohit

arXiv.org Artificial Intelligence

Large language models (LLMs) now support context windows exceeding 128K tokens, but this comes with significant memory requirements and high inference latency. Quantization can mitigate these costs, but may degrade performance. In this work, we present the first systematic evaluation of quantized LLMs on tasks with long inputs (>64K tokens) and long-form outputs. Our evaluation spans 9.7K test examples, five quantization methods (FP8, GPTQ-int8, AWQ-int4, GPTQ-int4, BNB-nf4), and five models (Llama-3.1 8B and 70B; Qwen-2.5 7B, 32B, and 72B). We find that, on average, 8-bit quantization preserves accuracy (~0.8% drop), whereas 4-bit methods lead to substantial losses, especially for tasks involving long-context inputs (drops of up to 59%). This degradation tends to worsen when the input is in a language other than English. Crucially, the effects of quantization depend heavily on the quantization method, model, and task. For instance, while Qwen-2.5 72B remains robust under BNB-nf4, Llama-3.1 70B experiences a 32% performance drop on the same task. These findings highlight the importance of a careful, task-specific evaluation before deploying quantized LLMs, particularly in long-context scenarios and for languages other than English.


Public Evaluation on Potential Social Impacts of Fully Autonomous Cybernetic Avatars for Physical Support in Daily-Life Environments: Large-Scale Demonstration and Survey at Avatar Land

Hafi, Lotfi El, Onishi, Kazuma, Hasegawa, Shoichi, Oyama, Akira, Ishikawa, Tomochika, Osada, Masashi, Tornberg, Carl, Kado, Ryoma, Murata, Kento, Hashimoto, Saki, Villalobos, Sebastian Carrera, Taniguchi, Akira, Ricardez, Gustavo Alfonso Garcia, Hagiwara, Yoshinobu, Aoki, Tatsuya, Iwata, Kensuke, Horii, Takato, Horikawa, Yukiko, Miyashita, Takahiro, Taniguchi, Tadahiro, Ishiguro, Hiroshi

arXiv.org Artificial Intelligence

However, to ensure the robustness of exophora resolution during the daily-life support CA R&D zone demonstration, we preloaded a user's pose skeleton and transcribed instruction data prepared offline, and manually recorded object locations as the model inputs. In addition, instead of demonstrating active exploration with visitors, which is time-consuming by nature, a pre-recorded video of the exploration performed on the site of Avatar Land was shown on TV displays. B. Object Manipulation After moving near the target object inferred by the ex-ophora resolution model, the Fetch CA picks up the object and places it on a Kachaka CA for delivery to the user, as the daily-life support CA R&D zone demonstration was designed to emphasize multi-CA collaboration. Although the position of the object is assumed to be known, in reality, there may be errors in self-position estimation, or the object position may have changed since the exophora resolution. Therefore, additional object detection was conducted with Detic [17] before grasping the target object for confirmation. The detected results were then combined with depth information to estimate the position of the target object. The position for placing the object on the Kachaka CA also needed to be determined similarly, so AR markers were attached to the shelf to compute the relative positions between the two CAs.


Multi-Dialect Vietnamese: Task, Dataset, Baseline Models and Challenges

Van Dinh, Nguyen, Dang, Thanh Chi, Nguyen, Luan Thanh, Van Nguyen, Kiet

arXiv.org Artificial Intelligence

Vietnamese, a low-resource language, is typically categorized into three primary dialect groups that belong to Northern, Central, and Southern Vietnam. However, each province within these regions exhibits its own distinct pronunciation variations. Despite the existence of various speech recognition datasets, none of them has provided a fine-grained classification of the 63 dialects specific to individual provinces of Vietnam. To address this gap, we introduce Vietnamese Multi-Dialect (ViMD) dataset, a novel comprehensive dataset capturing the rich diversity of 63 provincial dialects spoken across Vietnam. Our dataset comprises 102.56 hours of audio, consisting of approximately 19,000 utterances, and the associated transcripts contain over 1.2 million words. To provide benchmarks and simultaneously demonstrate the challenges of our dataset, we fine-tune state-of-the-art pre-trained models for two downstream tasks: (1) Dialect identification and (2) Speech recognition. The empirical results suggest two implications including the influence of geographical factors on dialects, and the constraints of current approaches in speech recognition tasks involving multi-dialect speech data. Our dataset is available for research purposes.


CON: Continual Object Navigation via Data-Free Inter-Agent Knowledge Transfer in Unseen and Unfamiliar Places

Terashima, Kouki, Iwata, Daiki, Tanaka, Kanji

arXiv.org Artificial Intelligence

This work explores the potential of brief inter-agent knowledge transfer (KT) to enhance the robotic object goal navigation (ON) in unseen and unfamiliar environments. Drawing on the analogy of human travelers acquiring local knowledge, we propose a framework in which a traveler robot (student) communicates with local robots (teachers) to obtain ON knowledge through minimal interactions. We frame this process as a data-free continual learning (CL) challenge, aiming to transfer knowledge from a black-box model (teacher) to a new model (student). In contrast to approaches like zero-shot ON using large language models (LLMs), which utilize inherently communication-friendly natural language for knowledge representation, the other two major ON approaches -- frontier-driven methods using object feature maps and learning-based ON using neural state-action maps -- present complex challenges where data-free KT remains largely uncharted. To address this gap, we propose a lightweight, plug-and-play KT module targeting non-cooperative black-box teachers in open-world settings. Using the universal assumption that every teacher robot has vision and mobility capabilities, we define state-action history as the primary knowledge base. Our formulation leads to the development of a query-based occupancy map that dynamically represents target object locations, serving as an effective and communication-friendly knowledge representation. We validate the effectiveness of our method through experiments conducted in the Habitat environment.


Vintern-1B: An Efficient Multimodal Large Language Model for Vietnamese

Doan, Khang T., Huynh, Bao G., Hoang, Dung T., Pham, Thuc D., Pham, Nhat H., Nguyen, Quan T. M., Vo, Bang Q., Hoang, Suong N.

arXiv.org Artificial Intelligence

In this report, we introduce Vintern-1B, a reliable 1-billion-parameters multimodal large language model (MLLM) for Vietnamese language tasks. By integrating the Qwen2-0.5B-Instruct language model with the InternViT-300M-448px visual model, Vintern-1B is optimized for a range of applications, including optical character recognition (OCR), document extraction, and general question-answering in Vietnamese context. The model is fine-tuned on an extensive dataset of over 3 million image-question-answer pairs, achieving robust performance and reliable results across multiple Vietnamese language benchmarks like OpenViVQA and ViTextVQA. Vintern-1B is small enough to fit into various on-device applications easily. Additionally, we have open-sourced several Vietnamese vision question answering (VQA) datasets for text and diagrams, created with Gemini 1.5 Flash. Our models are available at: https://huggingface.co/5CD-AI/Vintern-1B-v2.


A Hybrid-Layered System for Image-Guided Navigation and Robot Assisted Spine Surgery

T, Suhail Ansari, Maik, Vivek, Naheem, Minhas, Ram, Keerthi, Lakshmanan, Manojkumar, Sivaprakasam, Mohanasankar

arXiv.org Artificial Intelligence

In response to the growing demand for precise and affordable solutions for Image-Guided Spine Surgery (IGSS), this paper presents a comprehensive development of a Robot-Assisted and Navigation-Guided IGSS System. The endeavor involves integrating cutting-edge technologies to attain the required surgical precision and limit user radiation exposure, thereby addressing the limitations of manual surgical methods. We propose an IGSS workflow and system architecture employing a hybrid-layered approach, combining modular and integrated system architectures in distinctive layers to develop an affordable system for seamless integration, scalability, and reconfigurability. We developed and integrated the system and extensively tested it on phantoms and cadavers. The proposed system's accuracy using navigation guidance is 1.020 mm, and robot assistance is 1.11 mm on phantoms. Observing a similar performance in cadaveric validation where 84% of screw placements were grade A, 10% were grade B using navigation guidance, 90% were grade A, and 10% were grade B using robot assistance as per the Gertzbein-Robbins scale, proving its efficacy for an IGSS. The evaluated performance is adequate for an IGSS and at par with the existing systems in literature and those commercially available. The user radiation is lower than in the literature, given that the system requires only an average of 3 C-Arm images per pedicle screw placement and verification


AAPMT: AGI Assessment Through Prompt and Metric Transformer

Huang, Benhao

arXiv.org Artificial Intelligence

The emergence of text-to-image models marks a significant milestone in the evolution of AI-generated images (AGIs), expanding their use in diverse domains like design, entertainment, and more. Despite these breakthroughs, the quality of AGIs often remains suboptimal, highlighting the need for effective evaluation methods. These methods are crucial for assessing the quality of images relative to their textual descriptions, and they must accurately mirror human perception. Substantial progress has been achieved in this domain, with innovative techniques such as BLIP and DBCNN contributing significantly. However, recent studies, including AGIQA-3K, reveal a notable discrepancy between current methods and state-of-the-art (SOTA) standards. This gap emphasizes the necessity for a more sophisticated and precise evaluation metric. In response, our objective is to develop a model that could give ratings for metrics, which focuses on parameters like perceptual quality, authenticity, and the correspondence between text and image, that more closely aligns with human perception. In our paper, we introduce a range of effective methods, including prompt designs and the Metric Transformer. The Metric Transformer is a novel structure inspired by the complex interrelationships among various AGI quality metrics. The code is available at https://github.com/huskydoge/CS3324-Digital-Image-Processing/tree/main/Assignment1